Setup and Recap
We can now have a list of genes ordered according to their evidence for being differentially-expressed. You should have saved a results_TGF_vs_CTR_annotated.csv file in the previous session.
library(tidyverse)
library(DESeq2)
results_tgf <- read_csv("out_data/DESeq_TGF_vs_CTR_annotated.csv")
Rows: 57914 Columns: 10
── Column specification ──────────────────────────────────────────
Delimiter: ","
chr (3): row, SYMBOL, GENENAME
dbl (7): baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, ...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dds <- readRDS("Robjects/dds.rds")
If not, one can be downloaded from the course website
dir.create("out_data",showWarnings = FALSE)
download.file("https://github.com/sheffield-bioinformatics-core/bms31004_2022/raw/main/out_data/DESeq_TGF_vs_CTR_annotated.csv",destfile = "out_data/DESeq_TGF_vs_CTR_annotated.csv")
results_tgf <- read_csv("out_data/DESeq_TGF_vs_CTR_annotated.csv")
download.file("https://github.com/sheffield-bioinformatics-core/bms31004_2022/raw/main/Robjects/dds.rds",destfile = "Robjects/dds.rds")
So far we have focused on individual genes that show differences between our conditions of interest. This might be of interest in some studies, but there are various approaches we could use to tell if specific pathways / process can explain the differences between conditions.
Making a heatmap from the differential expression results
A heatmap is a very popular method for visualising gene expression dataset. It shows the expression levels of a set of genes, and how those genes separate the samples in the dataset into groups. Like the PCA plot we used previously, it is an unsupervised method whereby known biological groups are added after the plot has been created.
However, we don’t use the whole expression matrix to construct the heatmap; attempting to do this will quickly cause R to crash. Moreover, the plot wouldn’t be particularly informative as most genes are not expected to change between biological conditions, or won’t even be expressed.
We often use heatmaps to show a very small set of genes. The most statistically-significant genes are a popular choice as they should discriminate well between biological groups.
The first step is to identify the Ensembl IDs of the most significant genes
# slice is a tool to extract numbered rows from a data frame
de_ids <- results_tgf %>%
arrange(padj) %>%
dplyr::slice(1:10) %>%
pull("row")
de_ids
[1] "ENSG00000172061" "ENSG00000182667" "ENSG00000120708"
[4] "ENSG00000187498" "ENSG00000139211" "ENSG00000133110"
[7] "ENSG00000119681" "ENSG00000115414" "ENSG00000060718"
[10] "ENSG00000049540"
The heatmap tool we are going to use requires a count matrix as input, and our count matrices have Ensembl IDs as row identifiers. As with the boxplots and PCA plots from before this visualisation works best for transformed counts.
library(pheatmap)
vsd <- vst(dds)
using 'avgTxLength' from assays(dds), correcting for library size
heatmap_data <- assay(vsd)[de_ids,]
pheatmap(heatmap_data)

A dendrogram is used to show the relationship between samples; whereby samples that have a short “branch” joining them being similar to each other. However, samples being next to each other on the plot does not imply that they are closely related.
The code to generate the heatmap can be modified to include useful sample annotations. The meta data for this dataset is stored in dds but must undergone a bit of modification before we can include it.
## the meta data must be a data frame
samp_anno <- data.frame(colData(dds))
## the rownames must match the columns of the data being plotted
rownames(samp_anno) <- dds$Run
pheatmap(heatmap_data,
annotation_col = samp_anno)

We can do some tidying-up as not all the columns in our annotation data frame need to be displayed
samp_anno <- dplyr::select(samp_anno, Treated, condition)
pheatmap(heatmap_data,
annotation_col = samp_anno)

Exercise:
- Make another heatmap of the dataset, but this time using the 30 genes with the largest absolute fold-change
- The cells in the heatmap are often scaled at a gene-level according to the change from mean rather than absolute value. Look at the help for
pheatmap to see how to change the heatmap
- It would be more informative to have the rows labeled according to the
SYMBOL rather than Ensembl ID. Look at the help for pheatmap; which argument do you need to change? Look back at the code used to create the de_ids variables, and modify to obtain the SYMBOL corresponding to the 30 genes with the largest absolute fold-change.
Making a heatmap from a pre-defined set of genes
The heatmaps we have generated so far, whilst being attractive, do not tell us anything new about the dataset. We specifically picked genes that we know to be differentially-expressed between groups, so shouldn’t be surprised that they separate the samples.
The authors of the original publication had a hypothesis the Extra-Cellular Matrix (ECM) genes would be altered upon TGF treatment. Now that we have associated gene names to our dataset, we can explore this hypothesis. The first step is to determine which genes belong to the pathway using the org.Hs.eg.db package
library(org.Hs.eg.db)
pathway_genes <- AnnotationDbi::select(org.Hs.eg.db,
keys = "GO:0030198",
keytype = "GO",
columns="ENSEMBL") %>% pull(ENSEMBL)
'select()' returned 1:many mapping between keys and
columns
We can follow the same steps as above to create a matrix of counts to be displayed in the heatmap. However, our first attempt to create the matrix produces an error.
heatmap_data <- assay(vsd)[pathway_genes,]
Error in assay(vsd)[pathway_genes, ] : subscript out of bounds
The error is not particularly informative but can be fixed by checking that all the names in pathway_genes actually appear in our dataset. If we try and subset our data according to row names that do not exist, then R will produce an error.
all(pathway_genes %in% rownames(dds))
[1] FALSE
table(pathway_genes %in% rownames(dds))
FALSE TRUE
29 282
We are a little bit closer, but now the pheatmap code produces an error.
pathway_genes <- pathway_genes[pathway_genes %in% rownames(dds)]
heatmap_data <- assay(vsd)[pathway_genes,]
pheatmap(heatmap_data,
annotation_col = samp_anno,scale="row")
pheatmap(heatmap_data,
annotation_col = samp_anno,scale="row")
The error does not occur when the rows are not scaled. So the error must occur during the clustering of rows (genes).
pathway_genes <- pathway_genes[pathway_genes %in% rownames(dds)]
heatmap_data <- assay(vsd)[pathway_genes,]
pheatmap(heatmap_data,
annotation_col = samp_anno)

The error implies that there is a problem with NA values, but upon inspection it doesn’t seem like there are any missing values.
any(is.na(heatmap_data))
[1] FALSE
However, we do have some rows which contain all the same values. This causes an issue, as scaling the rows involves dividing gene counts by their standard deviation. This causes NA values to appear.
rowVars(heatmap_data)
We can remove rows with 0 variance with some confidence and proceed to make the heatmap.
heatmap_data <- heatmap_data[rowVars(heatmap_data) >0, ]
pheatmap(heatmap_data,
annotation_col = samp_anno,scale="row")

On inspecting the heatmap it does appear that the genes we have selected can distinguish TGF from CTRL samples.
We can also use a volcano plot to see how differentially-expressed the genes from this pathway are in the context of the whole dataset. The technique is labeling the genes is similar to the previous workshop. Since ggplot2 can assign plot aesthetics based on our data frame, we need to introduce a column that can distinguish genes belonging to our pathway of interest.
## We choose to change the alpha parameter so that the colour of ECM stands out
results_tgf %>%
mutate(ECM_Gene = row %in% pathway_genes,"") %>%
ggplot(aes(x = log2FoldChange, y = -log10(padj), col=ECM_Gene,alpha=ECM_Gene)) + geom_point() + scale_color_manual(values = c("black","red")) + scale_alpha_manual(values=c(0.1,1))
Warning: Removed 39184 rows containing missing values (geom_point).

So it does are least appear that some of the most significant genes belong to this pathway.
Exercise: Make a plot to demonstrate whether the genes belonging to the ECM pathway are more significant (as measured by -log\(_{10}\) adjusted p-value) than genes that do not belong to the pathway.

Pathways analysis
In this section we move towards discovering if our results are biologically significant. Are the genes that we have picked statistical flukes, or are there some commonalities.
There are two different approaches one might use, and we will cover the theory behind both.
Threshold-based Gene Set Testing
For a particular pathway we need to calculate how many genes were identified as differentially-expressed and compare to how many we would be expect by chance. Or in other words, if we repeatedly generated a list of differentially-expressed genes at random how many genes from this pathway would be expect to see.
Taking our ECM genes as an example, we can then add new columns in the data frame according to whether a gene belongs to this pathway, and whether it is differentially-expressed.
go_table <- mutate(results_tgf,
inPathway = row %in% pathway_genes,
isDE = padj < 0.05 & abs(log2FoldChange) > 1)
go_table
Cross-tabulating the two new columns gives a basis for a statistical test
table(go_table$inPathway, go_table$isDE)
FALSE TRUE
FALSE 30958 533
TRUE 180 37
The Fisher’s exact test or chi-squared test (as seen here) can then be used
chisq.test(table(go_table$inPathway, go_table$isDE))
Warning in chisq.test(table(go_table$inPathway, go_table$isDE)) :
Chi-squared approximation may be incorrect
Pearson's Chi-squared test with Yates' continuity
correction
data: table(go_table$inPathway, go_table$isDE)
X-squared = 279.32, df = 1, p-value < 2.2e-16
In reality it would be impractical to test all possible pathways in this manner, so there are a number of Bioconductor packages that automate the process
Analysis with clusterProfiler
clusterProfiler is a Bioconductor package for over-representation analysis. It’s main advantage is that it provides some nice visualisation methods.
The main function is enrichGO which requires the IDs of genes found to be differentially-expressed (sigGenes) and the IDs of all genes in the dataset (universe). It uses the org.Hs.eg.db package to map between gene names and biological pathways.
library(clusterProfiler)
universe <- results_tgf %>% pull(row)
sigGenes <- results_tgf %>%
filter(padj < 0.05, !is.na(row)) %>% pull(row)
enrich_go <- enrichGO(
gene= sigGenes,
OrgDb = org.Hs.eg.db,
keyType = "ENSEMBL",
ont = "BP",
universe = universe,
qvalueCutoff = 0.05,
readable=TRUE
)
The result of enrichGo can be turned into a data frame for easier interpretation. The data frame ranks the most significant pathways and gives statistics about how many genes from that pathway are found in the list of DE genes and in the background list.
enrich_go %>% data.frame
A dot plot can show us the most enriched pathways, and the size of each.
dotplot(enrich_go,showCategory=20)

A disadvantage of the GO analysis is that it can identify terms with very similar sets of genes that perform similar biological function. We can use an upset plot to understand these overlaps.
enrichplot::upsetplot(enrich_go)

Whilst the over-representation analysis used here has revealed some interesting pathways, it may not work well in all circumstances; especially when very few genes pass the traditional cut-offs. An alternative approach is available under such circumstances.
Gene set enrichment analysis (GSEA)
An appealing feature of the GSEA method is that it does not require us to impose arbitrary cut-offs on the dataset to decide what is differentially-expressed or not. The steps in producing the input required for GSEA are i) retrieving the ranked statistics ii) naming each one according to a chosen identifier (ENSEMBL or ENTREZID for example).
The clusterProfiler package also includes an implementation of the GSEA algorithm, and the function works in much the same way as enrichGO from above.
ranked_genes <- results_tgf %>%
arrange(desc(stat)) %>%
filter(!is.na(stat))
geneList <- pull(ranked_genes, stat)
names(geneList) <- pull(ranked_genes, row)
gse_GO <- gseGO(geneList = geneList,
OrgDb = org.Hs.eg.db,
ont = "BP",keyType = "ENSEMBL")
preparing geneSet collections...
GSEA analysis...
Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, :
There are ties in the preranked stats (11.31% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
Warning in fgseaMultilevel(...) :
For some pathways, in reality P-values are less than 1e-10. You can set the `eps` argument to zero for better estimation.
leading edge analysis...
done...
gse_GO %>% as.data.frame
An overview of the results can be provided by a “ridge plot”. This allows comparison of the test statistics for each of the top enriched pathways.
## we have changed the fig.width chunk option so that the plot displays properly
ridgeplot(gse_GO)
Picking joint bandwidth of 0.704

An upset plot can still be produced, but this time the distribution of statistics for overlapping categories can be produced.
enrichplot::upsetplot(gse_GO)

The results confirm that the ECM pathway has many differentially-expressed genes (more than we would expect by chance). Moreover, there is a tendancy for these genes to be up-regulated; as indicated by the high positive enrichment score. Another way to visualise the GSEA results, that is typically produced from the GSEA java app, is the so-called enrichment plot. The black vertical lines shown where genes from the pathway are located amongst the ranked gene list. The red dotted line shows where the highest concentration of genes is.
gseaplot(gse_GO,geneSetID = "GO:0030198")

The enrichment plot for a gene set with a high negative enrichment score reveals a different pattern. The interpretation of this results is that most genes in this pathway are down-regulated in TGF-treated samples.
gseaplot(gse_GO,geneSetID = "GO:0002283")

Exercise
- In addition to enriched GO terms,
clusterProfiler can also find enriched KEGG terms using the enrichKEGG function. There are a couple of changes that are required from enrichGO
ENTREZID have to be used as the identifer type
- the user must input an appropriate organism code. The code for humans is
hsa.
- Use the
enrichKEGG function to identify enriched KEGG terms in the analysis.
- (Optional) If you have time, use the gseKEGG to perform GSEA using KEGG terms.
?enrichKEGG
Conclusions
We have now completed the main Bioinformatics workflow for the RNA-seq analysis of our chosen dataset. The workflow has made use of a handful of Bioconductor packages, but also the tidyverse packages that we have used previously. We have spent a lot of time exploring our results, but the key steps are summarised below:-
- Importing the transcript-level counts produced by
tximport into R
- Creating a
DESeq2 dataset from these counts and our meta data
- Quality assessment to check for outliers and verify the relationships between samples
- Differential expression analysis using the
DESeq and results functions to generate gene lists
- Pathways analysis using
clusterProfiler to gain biological insights
---
title: "BMS353 Bioinformatics for Biomedical Science - Week 6"
author: "Module Coordinator Mark Dunning"
output: 
  html_notebook: 
    toc: yes
    toc_float: yes
    css: stylesheets/styles.css
editor_options: 
  chunk_output_type: inline
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Learning outcomes

- How can we use a "heatmap" to visualise our differential expression results
- How to explore the differential expression results for a given pathway
- Identify over-represented pathways - given a set of differentially-expressed genes
- Identify pathways with a tendancy to be over- or under-expressed; without having to use an p-value cut-offs 

## Setup and Recap

We can now have a list of genes ordered according to their evidence for being differentially-expressed. You should have saved a `results_TGF_vs_CTR_annotated.csv` file in the previous session. 

```{r}
library(tidyverse)
library(DESeq2)
results_tgf  <- read_csv("out_data/DESeq_TGF_vs_CTR_annotated.csv")
dds <- readRDS("Robjects/dds.rds")
```

If not, one can be downloaded from the course website

```{r eval=FALSE}
dir.create("out_data",showWarnings = FALSE)
download.file("https://github.com/sheffield-bioinformatics-core/bms31004_2022/raw/main/out_data/DESeq_TGF_vs_CTR_annotated.csv",destfile = "out_data/DESeq_TGF_vs_CTR_annotated.csv")
results_tgf  <- read_csv("out_data/DESeq_TGF_vs_CTR_annotated.csv")

download.file("https://github.com/sheffield-bioinformatics-core/bms31004_2022/raw/main/Robjects/dds.rds",destfile = "Robjects/dds.rds")

```

So far we have focused on individual genes that show differences between our conditions of interest. This might be of interest in some studies, but there are various approaches we could use to tell if specific pathways / process can explain the differences between conditions.

## Making a heatmap from the differential expression results

A heatmap is a very popular method for visualising gene expression dataset. It shows the expression levels of a set of genes, and how those genes separate the samples in the dataset into groups. Like the PCA plot we used previously, it is an unsupervised method whereby known biological groups are added after the plot has been created.

However, we don't use the whole expression matrix to construct the heatmap; attempting to do this will quickly cause R to crash. Moreover, the plot wouldn't be particularly informative as most genes are not expected to change between biological conditions, or won't even be expressed.

We often use heatmaps to show a very small set of genes. The most statistically-significant genes are a popular choice as they should discriminate well between biological groups.

The first step is to identify the *Ensembl* IDs of the most significant genes

```{r}
# slice is a tool to extract numbered rows from a data frame
de_ids <- results_tgf %>% 
  arrange(padj) %>% 
  dplyr::slice(1:10) %>% 
  pull("row")
de_ids
```
The heatmap tool we are going to use requires a count matrix as input, and our count matrices have Ensembl IDs as row identifiers. As with the boxplots and PCA plots from before this visualisation works best for transformed counts.

```{r}
library(pheatmap)
vsd <- vst(dds)
heatmap_data <- assay(vsd)[de_ids,]
pheatmap(heatmap_data)
```

A *dendrogram* is used to show the relationship between samples; whereby samples that have a short "branch" joining them being similar to each other. However, samples being next to each other on the plot **does not** imply that they are closely related.

The code to generate the heatmap can be modified to include useful sample annotations. The meta data for this dataset is stored in `dds` but must undergone a bit of modification before we can include it.
 
```{r}
## the meta data must be a data frame
samp_anno <- data.frame(colData(dds))
## the rownames must match the columns of the data being plotted
rownames(samp_anno) <- dds$Run
pheatmap(heatmap_data,
         annotation_col = samp_anno)
```

We can do some tidying-up as not all the columns in our annotation data frame need to be displayed

```{r}
samp_anno <- dplyr::select(samp_anno, Treated, condition) 

pheatmap(heatmap_data,
         annotation_col = samp_anno)
```

<div class="exercise">
**Exercise**: 

- Make another heatmap of the dataset, but this time using the 30 genes with the largest absolute fold-change
- The cells in the heatmap are often *scaled* at a gene-level according to the change from mean rather than absolute value. Look at the help for `pheatmap` to see how to change the heatmap
- It would be more informative to have the rows labeled according to the `SYMBOL` rather than Ensembl ID. Look at the help for `pheatmap`; which argument do you need to change? Look back at the code used to create the `de_ids` variables, and modify to obtain the `SYMBOL` corresponding to the 30 genes with the largest absolute fold-change.

</exercise>


## Making a heatmap from a pre-defined set of genes

The heatmaps we have generated so far, whilst being attractive, do not tell us anything new about the dataset. We specifically picked genes that we know to be differentially-expressed between groups, so shouldn't be surprised that they separate the samples. 

The authors of the original publication had a hypothesis the Extra-Cellular Matrix (ECM) genes would be altered upon TGF treatment. Now that we have associated gene names to our dataset, we can explore this hypothesis. The first step is to determine which genes belong to the pathway using the `org.Hs.eg.db` package

```{r}
library(org.Hs.eg.db)
pathway_genes <- AnnotationDbi::select(org.Hs.eg.db,
                                       keys = "GO:0030198",
                                       keytype = "GO",
                                       columns="ENSEMBL") %>% pull(ENSEMBL)
```

We can follow the same steps as above to create a matrix of counts to be displayed in the heatmap. However, our first attempt to create the matrix produces an error.

```{r eval=FALSE}
heatmap_data <- assay(vsd)[pathway_genes,]

```
```
Error in assay(vsd)[pathway_genes, ] : subscript out of bounds
```

The error is not particularly informative but can be fixed by checking that all the names in `pathway_genes` actually appear in our dataset. If we try and subset our data according to row names that do not exist, then R will produce an error.

```{r}
all(pathway_genes %in% rownames(dds))
table(pathway_genes %in% rownames(dds))
```


We are a little bit closer, but now the `pheatmap` code produces an error.

```{r eval=FALSE}
pathway_genes <- pathway_genes[pathway_genes %in% rownames(dds)]

heatmap_data <- assay(vsd)[pathway_genes,]

pheatmap(heatmap_data,
         annotation_col = samp_anno,scale="row")


```


```
pheatmap(heatmap_data,
         annotation_col = samp_anno,scale="row")
```

The error does not occur when the rows are *not scaled*. So the error must occur during the clustering of rows (genes).

```{r}
pathway_genes <- pathway_genes[pathway_genes %in% rownames(dds)]

heatmap_data <- assay(vsd)[pathway_genes,]
pheatmap(heatmap_data,
         annotation_col = samp_anno)
```

The error implies that there is a problem with `NA` values, but upon inspection it doesn't seem like there are any missing values.

```{r}
any(is.na(heatmap_data))
```

However, we do have some rows which contain all the same values. This causes an issue, as scaling the rows involves dividing gene counts by their standard deviation. This causes `NA` values to appear.

```{r}
rowVars(heatmap_data)
```

We can remove rows with 0 variance with some confidence and proceed to make the heatmap.

```{r}
heatmap_data <- heatmap_data[rowVars(heatmap_data) >0, ]

pheatmap(heatmap_data,
         annotation_col = samp_anno,scale="row")
```

On inspecting the heatmap it does appear that the genes we have selected can distinguish TGF from CTRL samples. 

We can also use a volcano plot to see how differentially-expressed the genes from this pathway are in the context of the whole dataset. The technique is labeling the genes is similar to the previous workshop. Since `ggplot2` can assign plot aesthetics based on our data frame, we need to introduce a column that can distinguish genes belonging to our pathway of interest.

```{r}
## We choose to change the alpha parameter so that the colour of ECM stands out
results_tgf %>% 
  mutate(ECM_Gene = row %in% pathway_genes,"") %>% 
  ggplot(aes(x = log2FoldChange, y = -log10(padj), col=ECM_Gene,alpha=ECM_Gene)) + geom_point() + scale_color_manual(values = c("black","red")) + scale_alpha_manual(values=c(0.1,1))
```

So it does are least appear that some of the most significant genes belong to this pathway. 

<div class="exercise">
**Exercise**: Make a plot to demonstrate whether the genes belonging to the ECM pathway are more significant (as measured by -log$_{10}$ adjusted p-value) than genes that do not belong to the pathway.
</div>

![](images/ecm_genes_boxplot.png)

# Pathways analysis

In this section we move towards discovering if our results are ***biologically significant***. Are the genes that we have picked statistical flukes, or are there some commonalities. 

There are two different approaches one might use, and we will cover the theory behind both.

## Threshold-based Gene Set Testing

For a particular pathway we need to calculate how many genes were identified as differentially-expressed and compare to *how many we would be expect by chance*. Or in other words, if we repeatedly generated a list of differentially-expressed genes at random how many genes from this pathway would be expect to see.


Taking our ECM genes as an example, we can then add new columns in the data frame according to whether a gene belongs to this pathway, and whether it is differentially-expressed.

```{r}
go_table <- mutate(results_tgf, 
                   inPathway = row %in% pathway_genes,
                   isDE = padj < 0.05 & abs(log2FoldChange) > 1)
go_table
```

Cross-tabulating the two new columns gives a basis for a statistical test

```{r}
table(go_table$inPathway, go_table$isDE)
```

The Fisher's exact test or chi-squared test (as seen here) can then be used

```{r}
chisq.test(table(go_table$inPathway, go_table$isDE))
```
    
In reality it would be impractical to test all possible pathways in this manner, so there are a number of Bioconductor packages that automate the process


### Analysis with clusterProfiler

`clusterProfiler` is a Bioconductor package for over-representation analysis. It's main advantage is that it provides some nice visualisation methods.

The main function is `enrichGO` which requires the IDs of genes found to be differentially-expressed (`sigGenes`) and the IDs of *all* genes in the dataset (`universe`). It uses the `org.Hs.eg.db` package to map between gene names and biological pathways.

```{r message=FALSE, warning=FALSE}
library(clusterProfiler)
universe <- results_tgf %>% pull(row)
sigGenes <- results_tgf %>% 
  filter(padj < 0.05, !is.na(row)) %>% pull(row)

enrich_go <- enrichGO(
  gene= sigGenes,
  OrgDb = org.Hs.eg.db,
  keyType = "ENSEMBL",
  ont = "BP",
  universe = universe,
  qvalueCutoff = 0.05,
  readable=TRUE
)

```

The result of `enrichGo` can be turned into a data frame for easier interpretation. The data frame ranks the most significant pathways and gives statistics about how many genes from that pathway are found in the list of DE genes and in the background list.

```{r}
enrich_go %>% data.frame
```

A dot plot can show us the most enriched pathways, and the size of each.

```{r}
dotplot(enrich_go,showCategory=20)
```
A disadvantage of the GO analysis is that it can identify terms with very similar sets of genes that perform similar biological function. We can use an [upset plot](https://upset.app/) to understand these overlaps.

```{r}
enrichplot::upsetplot(enrich_go)
```

Whilst the over-representation analysis used here has revealed some interesting pathways, it may not work well in all circumstances; especially when very few genes pass the traditional cut-offs. An alternative approach is available under such circumstances.

## Gene set enrichment analysis (GSEA)

An appealing feature of the **GSEA** method is that it does not require us to impose arbitrary cut-offs on the dataset to decide what is differentially-expressed or not. The steps in producing the input required for GSEA are i) retrieving the ranked statistics ii) naming each one according to a chosen identifier (`ENSEMBL` or `ENTREZID` for example).

The `clusterProfiler` package also includes an implementation of the GSEA algorithm, and the function works in much the same way as `enrichGO` from above.

```{r}
ranked_genes <- results_tgf %>% 
  arrange(desc(stat)) %>% 
  filter(!is.na(stat))
geneList <- pull(ranked_genes, stat)
names(geneList) <- pull(ranked_genes, row)
gse_GO  <- gseGO(geneList = geneList,
        OrgDb = org.Hs.eg.db,
        ont = "BP",keyType = "ENSEMBL")
```

```{r}
gse_GO %>% as.data.frame
```
An overview of the results can be provided by a "ridge plot". This allows comparison of the test statistics for each of the top enriched pathways. 

```{r fig.width=8}
## we have changed the fig.width chunk option so that the plot displays properly
ridgeplot(gse_GO)
```
An upset plot can still be produced, but this time the distribution of statistics for overlapping categories can be produced.

```{r}
enrichplot::upsetplot(gse_GO)
```

The results confirm that the ECM pathway has many differentially-expressed genes (more than we would expect by chance). Moreover, there is a tendancy for these genes to be up-regulated; as indicated by the high positive enrichment score. Another way to visualise the GSEA results, that is typically produced from the GSEA java app, is the so-called enrichment plot. The black vertical lines shown where genes from the pathway are located amongst the ranked gene list. The red dotted line shows where the highest concentration of genes is.

```{r}
gseaplot(gse_GO,geneSetID = "GO:0030198")
```
The enrichment plot for a gene set with a high negative enrichment score reveals a different pattern. The interpretation of this results is that most genes in this pathway are down-regulated in TGF-treated samples.

```{r}
gseaplot(gse_GO,geneSetID = "GO:0002283")
```

## Exercise

<div class="exercise">

- In addition to enriched GO terms, `clusterProfiler` can also find enriched [KEGG](https://www.genome.jp/kegg/) terms using the `enrichKEGG` function. There are a couple of changes that are required from `enrichGO`
  - `ENTREZID` have to be used as the identifer type
  - the user must input an appropriate [organism code](https://www.genome.jp/kegg/catalog/org_list.html). The code for humans is `hsa`.
- Use the `enrichKEGG` function to identify enriched KEGG terms in the analysis.
- (Optional) If you have time, use the gseKEGG to perform GSEA using KEGG terms.

</div>

```{r}
?enrichKEGG
```

## Conclusions

We have now completed the main Bioinformatics workflow for the RNA-seq analysis of our chosen dataset. The workflow has made use of a handful of Bioconductor packages, but also the `tidyverse` packages that we have used previously. We have spent a lot of time exploring our results, but the key steps are summarised below:-

- Importing the transcript-level counts produced by `tximport` into R
- Creating a `DESeq2` dataset from these counts and our *meta data*
- Quality assessment to check for outliers and verify the relationships between samples
- Differential expression analysis using the `DESeq` and `results` functions to generate gene lists
- Pathways analysis using `clusterProfiler` to gain biological insights

